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Abstract 



We introduce a method for aggregating many least squares estimator so that the resulting 
estimate has two properties: sparsity and structure. That is, only a few candidate covariates 
are used in the resulting model, and the selected covariates follow some structure over the 
candidate covariates that is assumed to be known a priori. While sparsity is well studied 
in many settings, including aggregation, structured sparse methods are still emerging. We 
demonstrate a general framework for structured sparse aggregation that allows for a wide va- 
riety of structures, including overlapping grouped structures and general structural penalties 
defined as set functions on the set of covariates. We show that such estimators satisfy struc- 
tured sparse oracle inequalities — their finite sample risk adapts to the structured sparsity 
of the target. These inequalities reveal that under suitable settings, the structured sparse 
estimator performs at least as well as, and potentially much better than, a sparse aggregation 
estimator. We empirically establish the effectiveness of the method using simulation and an 
application to HIV drug resistance. 

Keywords: Sparsity, Variable Selection, Aggregation, Sparsity Oracle Inequalities, HIV 
Drug Resistance 
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1 Introduction 



In statistical learning, sparsity and variable selection are well studied and fundamental topics. 
Given a large set of candidate covariates, sparse models use only a few in the model. Sparse 
techniques often improve out of sample performance and aid in model interpretation. We 
focus on the linear regression setting. Here, we model a vector of responses y as a linear 
combination of M predictors, represented as an n x M data matrix X, via the equation 
y = X/3 + e, where /3 is a vector of linear coefficients and e is a vector of stochastic noise. 
The task is then to produce an estimate of /3, denoted /3, using X and y. Sparse modeling 
techniques produce a (3 with only a few nonzero entries, with the remaining set equal to zero, 
effectively excludes many covariates from the model. One example of a sparse regression 



method is the lasso estimator (Tibshirani, 1996): 



M 



/3l 



asso 



argmm||y 



In the above, A > is a tuning parameter. Here, the ii penalty encourages many entries of 
'^lasso identically zero, giving a sparse estimator. 

Suppose now that additional structural information is available about the covariates. 
We then seek to incorporate this information in our sparse modeling strategy, giving a 
structured, sparse model. For example, consider a factor covariate with u levels, such as 
in an ANOVA model, encoded as a set of m — 1 indicator variables in X. Taking this 
structure into account, we then would jointly select or exclude this set of covariates from 
our sparse model. More generally, suppose that we have a graph with M nodes, each node 
corresponding to a covariate. This graph might represent a spatial relationship between the 
covariates. A sparse model incorporating this information might jointly include or exclude 
sets of predictors corresponding to neighborhoods or cliques of the graph. In summary, 
sparsity seeks a (3 with few nonzero entries, whereas structured sparsity seeks a sparse (3 
where the nonzero entries following some a priori defined pattern. 

As an example, consider the results displayed in Figure [2] In the top left, we see a 
coefficient vector, rearranged as a square matrix. The nonzero entries, represented as white 
squares, have a clear structure with respect to the familiar two dimensional lattice. On the 
bottom row, we display the results of two sparse methods, including the lasso. The top right 
panel displays the results of one of the methods of this paper. Since our method also takes 
the structural information into account, it is able to more accurately re-create the sparsity 
pattern pictured in the top left. 

Though methods for structured sparsity are still emerging, there are now many examples 
in the literature. The grouped lasso (Yuan and Lin, 2006) allows for joint selection of 
covariates, where the groups of covariates partition the set of covariates. Subsequent work 



(Huang, Zhang and Metaxas, 2009 Jacob, Obozinski and Vert 2009 Jenatton, Obozinski 



and Bach 2010) extended this idea to allow for more flexible structures based on overlapping 



groups of covariates. Further, Bach (2008 ) and Zhao, Rocha and Yu (2009 ) proposed methods 



for Hierarchical structures and Kim and Xing (2010) as well as Peng et al. (2010) gave 
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methods in the multi-task setting for coherent variable selection across tasks. 

In this paper, we present an aggregation estimator that produces structured sparse mod- 
els. In the linear regression setting, aggregation estimators combine ma ny estimates of 0: 



{/3i, . . . , 13^} in some way to give an improved estimate /3 Aggregate' 



Bunea, Tsybakov 



and Wegkamp (2007) and the references therein for discussions of aggregation in general 



settings, and Yang ( 2001a|[6 ) for methods in the linear regression setting. In particular, we 
extend the methods and results given by RigoUet and Tsybakov (2010), who focused on 
sparse aggregation, where the estimated /3 Aggregate many entries equal to zero. Their 
sparse aggregation method combines in a weighted average the least squares estimates for 
each subset of the set of candidate covariates. For a particular model in the average, its 
weight is, in part, inversely exponentially proportional to the number of covariates in the 
model, i.e. the sparsity of the model. This strategy encourages a sparse /^Aggregate' 
extend this idea by proposing an alternate set of weights that are instead depend on the 
structured sparsity of the sparsity patterns, accordingly encouraging a structured sparse 

gregate ■ 

We give extensions that cover a wide range possible structure inducing strategies. These 
include overlapping grouped structures and structural penalties based on hierarchical struc- 
tures or arbitrary set functions. These parallel many convex methods for structured sparsity 
from the literature, see Section |3j Though structure can be useful for interpretability, we 
must consider whether injecting structure into a sparse method has a beneficial impact under 
reasonable conditions. In this paper we demonstrate that our estimators perform no worse 
than sparse estimators when the true model is structured sparse. In the group sparsity 
case, they can give dramatic improvements. These results hold for a very general class of 
structural modifications, including overlapping grouped structures. 

We first give a review the sparse aggregation method of RigoUet and Tsybakov (2010) 
in Section |2} In Section [3] we discuss our methods for structured sparse aggregation. We 
introduce two settings: structurally penalized sparse aggregation (Section 3.1), and group 
structured sparse aggregation (Section 3.2). We present the theoretical properties of these 
estimators in Section |4} We then present a simulation study and an application to HIV drug 
resistance in Section |5| We finally give some concluding remarks and suggestions for future 
directions in Section [6j Proofs of general versions of the main theoretical results are given 
in the supplementary material. 



2 Sparsity Pattern Aggregation 



The sparse aggregation method of RigoUet and Tsybakov (2010) builds on the results of Le- 



ung and Barron (2006). The method creates an aggregate estimator from a weighted average 
the 2^ ordinary least squares regressions on all subsets of the M candidate covariates. The 
method encourages sparsity by letting the weight in the average for a particular model in- 
crease as the sparsity of the model increases. We first establish our notation and setting, and 
then present the basic formulas behind the method. We finally discuss its implementation 
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via a stochastic greedy algorithm. 



2.1 Settings and the Sparsity Pattern Aggregation Estimator 

We consider the hnear regression modeh 

y = X^/3 + e. (2) 

Here, we have a response y G M", n x M data matrix X = [xi, . . . ,xm] — where Xj G M", 
and vector of coefficients /3 G R*^. From here on, we assume that X is normahzed so 
||xj||2 < 1 Vi. The entries of the n— vector of errors e are i.i.d. N{0,a'^). Assume that cr^ is 
known. Let || ■ ||p dnote the ip norm for p > I. Let supp(-) denote the support of a vector, 
the set of indices for which the entries are nonzero. Denote || ■ ||o = |supp(-)| as the io norm. 
Let the set X = {1, ... , M} index the set of candidate covariates. 

Define the set V = {0, 1}*^; \V\ = \2^\ = 2^^. V encodes all sparsity patterns over our 
set of candidate covariates — the ith element of p G "P is 1 if covariate i is included in the 
model, and if it is excluded. Let /3p be the ordinary least squares solution restricted to 
the sparsity pattern p: 

3p = argmin ||y - X/3||^. (3) 

/3eM^^: SUpp(/3)CSUpp(p) 

Define the training error of an estimate (3 to be: 

Error(3) = ||y-X3||2. (4) 
Then, the sparsity pattern aggregate estimator coefficients are defined as: 
^SPA Epep3pexp(-4^Error(3p)-^)7rp 



Epepexp (-^Error(3p) - Mo) 

g a weighted average over all sparsit 
in this average are a product of an exponentiated unbiased estimate of the risk and a prior. 



Here, we obtain (3 by taking a weighted average over all sparsity patterns p. The weights 



TTp, over the sparsity patterns. This strategy is based on the work of Leung and Barron 



(2006), who demonstrated this form results in several appealing theoretical properties which 



form the basis of the theory of RigoUet and Tsybakov (2010) and our own methods. RigoUet 



and Tsybakov (2010) consider the following prior: 

l|p||o<i? 

^P-=< I ^ ' ||p||o = M • (6) 

else 

Here, if is a normalizing constant and R = rank(X). The above prior places exponentially 
less weight on sparsity patterns as their io norm increases, up-weighting sparse models. The 
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weight of 1/2 on the OLS solution is included for theoretical calculations; in practice this 
case is treated as other cases — see the supplementary material. This specific choice of prior 
had many theoretical and computational advantages. In section [3} we consider modifications 
to the prior weight to encourage both structure and sparsity. 



2.2 Computation 



Exact computation of the sparsity pattern aggregate estimator is clearly impractical, since it 
would require fitting 2*^ models. iRigoUet and Tsybakovl (|2010|) give a Metropolis-Hastings 



stochastic greedy algorithm based on work by Alquier and Lounici (2010) for approximating 
the sparsity pattern aggregate — the procedure is reviewed in the supplement. The procedure 
performs a random walk over the hypercube of all sparsity patterns. Beginning with an empty 
model, in each step, one covariate is randomly selected from the candidate set, and proposed 
to be added to the model if it is already in the current model or to be removed from the 
current model otherwise. These proposals are accepted or rejected using a Metropolis step, 
with probability related to the product of the difference in risk and the ratio of prior weights. 

Two practical concerns arise from this approach. First, the algorithm assumes that is 
known. Second, the metropolis algorithm requires significant additional computation than 



competing sparse methods. Regarding the variance, RigoUet and Tsybakov (2010) proposed 



a two stage scheme: the algorithm is run twice, and the residuals from the first run provide 
an estimate for the variance for the second run. To the second point, a simple analysis 
of the algorithm reveals that at each iteration of the MCMC method, we must fit a linear 
regression model. In order to effectively explore the sparsity pattern hypercube, we must 
run the Markov chain on the order of M, the number of candidate predictors, iterations. 
We can therefore expect computation times on the order of a linear regression fit times M. 
When M is a much higher order than the number of observations, this is a concern. This 
makes the sparse estimator difficult to compute in very high dimensional settings. However, 
in a structured sparse problem, we may have structural information that effectively reduces 
the order of M, such as in group sparsity. 



3 Structured, Sparse Aggregation 

The sparsity pattern aggregate estimator derives its sparsity property from placing a prior on 
sparsity patterns that is inversely proportional to the £o norm of the pattern. This up-weights 
models with sparsity patterns with low £o norm, encouraging sparsity. We propose basing 
similar priors on different set functions than the C.q norm. These set functions are chosen 
so that the resulting estimator simultaneously encourages sparsity and structure. Thus, the 
resulting estimators upweight structured, sparse models. We consider two class of functions: 
structurally penalized norms and grouped norms. 
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3.1 Penalized Structured Sparsity Aggregate Estimator 

Consider penalizing the io norm with some non-negative set function that measures the 
structure of the sparsity pattern. We will show later (see Assumption [l] in Section 4.1) 



that if this set function if non-negative and does not exceed M, we can guarantee similar 

theoretical properties as the sparsity pattern aggregate estimator. More formally, consider 
the following extension: 

p eV : IIpIIo.c := IIpIIo + IIpIIc (7) 

where: ||p||, := ||supp(p)||, : 2^ ^ [0, M] C M, (8) 

nolle :=0. (9) 

We then define the following prior on V: 

^P.- ■={ I ' llnIL = M • (10) 



Where is a normalizing constant. For our subsequent theoretical analysis, we note that 
since ||p||o,c < 2M then we know that He < 4. We then define the structured sparsity 
aggregate (SSA) estimator as: 




Epep 3p exp 4^Error(3j 



Pllo 



^ ■= • ^^^^ 

Epe^exp (^-4^Error(/3p) - ^ j vr^ 



P,c 



We now discuss some possible choices for the structural penalty || ■ ||c. Note that the general 
consequence of the prior is that sparsity patterns with higher values of || ■ ||c will be down- 
weighted. At the same time, the prior still contains the io norm as an essential element, 
and so it enforces a trade-off between sparsity and the structure captured by the additional 
term. 

• Covariate Weighting. Consider the function ||p||c = "^fLi^^iPi such that X]f=i < 
M, Cj > V z. This has the effect of weighting the covariates, discouraging those with 
high weight to enter the model. These weights can be determined in a wide variety 
of ways, including simple prior belief elicitation. This weighting scheme is related to 



the prior suggested in Hoeting et al. (1999) in the bayesian model averaging setting 



This strategy also has the flavor of the individual weighting in the adaptive lasso. 



where Zou (2006) considered weighting each coordinate in the lasso using coefficient 



estimates from OLS or marginal regression. 



Graph Structures. Generalizing previous work. Bach (2010) suggested many struc- 



ture inducing set functions in the regularization setting. Many of these functions can be 
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easily adapted to this framework. For example, given a directed acyclic graph (DAG) 
structure over X, the following penalty encourages a hierarchical structure: 



IIpIIc = |{Ancestors of supp(p)}|. (12) 

If we desire strong hierarchy, we can additionally define vTp^c := if the sparsity pattern 
of p does not obey the hierarchical structure implied by the DAG. Strong hierarchy 
may also greatly increase the speed of the MCMC algorithm by restricting the number 
of predictors potentially sampled at each step. 

Alternately, suppose we have a set of weights over pairs of predictors represented by 
the function c? : X x X — )■ IR+. Given a graph over the candidate covariates, this 
could correspond to edge weights, or the shortest path between two nodes (covariates). 
More generally, it could correspond to a natural geometric structure such as a line or 



a lattice, see Percival, Roeder, Rosenfeld and Wasserman (2011) for such an example. 



We can use these weights to define the cut function: 

iipiic= Yl ^(''^■)- (13) 

jGSUpp(p); j^SUpp{p) 

This encourages sparsity patterns to partition the set X into two maximally discon- 
nected sets, as defined by low values of d{-,-). This would give sparsity patterns 
corresponding to isolated neighborhoods in the graph. 

• Cluster Counting. We finally propose a new || ■ ||c that measures the structure of 
the sparsity pattern by counting the number of clusters in p. Suppose, we now have 
a symmetric weight function (i : X x X — )• Suppose we also set a constant h > 0. 
Then, we count the clusters using the following procedure: 

1. Define the fully connected weighted graph over the set supp(p) with weights given 
byrf(-,-). 

2. Break all edges with weight great than h. 

3. Return the remaining number of connected components as ||p||c- 

This definition encourages sparsity patterns that are clustered with respect to d{-,-). 
For computational considerations, note that this strategy is the same as single linkage 
clustering with parameter h, or building a minimal spanning tree and breaking all 
edges with weight greater than h. For many geometries, this definition of || ■ ||c is easy 
to compute and update for the MCMC algorithm. 

3.2 Group Structured Sparsity Aggregate Estimator 

In the framework of structured sparsity, one popular representation of structure is via groups 



of variables, cf. Yuan and Lin (2006). For example, a factor covariate with u levels, as in an 



ANOVA model, can be represented as a collection of m — 1 indicator variables. We would not 
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select these variables individually, instead preferring to include or exclude them as a group. 
In the case where these groups partition X, this structure can be easily incorporated into the 
prior, theory, and implementation of the sparsity pattern aggregate estimator. Suppose we 
a priori define: 



G:-- 

mkg ■■- 



{g} such that g C Z \/g] and Uggg g 
|{^: (7nsupp(/3) 7^0}|, 



T^^g^g' eg.g^g' 



(14) 
(15) 

(16) 



||/3||i,g is the same as the grouped lasso penalty (Yuan and Lin, 2006), which is used to 



induce sparsity at the group level in the regularization setting. ||/3||o,e is simply the number 
of groups needed to cover the sparsity pattern of f3. Thus, we have simply replaced sparsity 
patterns over all subsets of predictors with sparsity patterns over all subsets of groups of 



predictors. We can show that the theoretical framework of RigoUet and Tsybakov (2010) 
holds with V = {0, 1}I^I, ||/3||o replaced with ||/3||o,g, and ||/3||i replaced with \\/3\\i,g. 



A more interesting and flexible case arises when we allow the elements of Q to overlap. 



Here, we adopt the framework of Jacob et al. (2009), who gave a norm and penalty for 



inducing sparsity patterns using overlapping groups in the regularization setting. In this 
case, we define the groups as any collection of sets of covariates: 



Q := {g} such that g C Z \/g] and Uggg g = Z. 
We now define the ^-decomposition as the following set of size \Q\\ 



Vg(/3) = K:(7G6;,v, 
such that Vg = (3. 



s.t. supp(vg) C 



(17) 

(18) 
(19) 



VseVg(/3) 

That is, Vg{f3) contains single for each g & Q. For arbitrary Q and f3, Vg{f3) is not unique. 
We then define the following functions, analogous to the £o and £i norms of the usual sparsity 
framework: 



m 



min IGI, 

Gcg;Ug6G!7=SUpp(/3) 



mm 

Ve(/3) 




(20) 



(21) 



In II ■ 111 g, the minimum is over all possible decomposition Vg(-). Computing || ■ ||o,g is difficult 
for arbitrary Q. However, in most applications Q has some regular structure which allows 



for efficient computation. The norm in Equation 20 leads to the following choice of prior on 
V: 



p,g 



1 

Hg 

1 

2 





Pllo.g 

2e\g\ 



PlIo.S 



l|p||o<i? 
|p||o = M 
else 



(22) 



10 



By considering all unions of groups, we obtain an upper bound for the normalizing constant 
Hg < 4. We then define the grouped sparsity aggregate (GSA) estimator as: 

^GSA Epep3pexp(-4^Error(3 )-Mo^7rp,g 

(3 := y ^ . (23) 

EpeP (-4^ ELi Error(/3p) - j ^^^^ 

We leave Q general throughout this section and the subsequent theoretical analysis. There 
are many possible definitions of Q, such as connected components or neighborhoods in a 



graph, groups of factor predictors, or application driven groups — see Jacob et al. (2009) 
for some examples. In particular, many of the structures mentioned in Section 3.1 can be 
encoded as a series of groups. 

4 Theoretical Properties 



RigoUet and Tsybakov ( |2010[ ) showed that the sparsity pattern aggregate estimator enjoyed 
great theoretical properties. In summary, they showed that the estimator adapted to the 
sparsity of the target, measured in both the io and ii norm. Further, they showed that their 
sparsity oracle inequalities were optimal in a minimax sense, in particular superior to rates 
obtained for popular estimators such as the lasso. Moreover, their results required fewer 



assumptions than those of the lasso, cf Bickel et al. (2009). In the supplementary material 



we give a theoretical framework for aggregation using priors of our form — Equation 10 



and 22 The following shows specific applications of this theory, yielding a set of structured 



sparse oracle inequalities, the first of their kind. 

4.1 Structurally Penalized £o Norm 

We first state an assumption: 

Assumption 1. For all p where R > \\p\\o > 0: 



* <logfl+ , 1. (24) 



IIpIIo.c V max(||p||o,c, 1)/ ' 

Numerical analysis reveals that a sufficient condition for this assumption is < ||p||c < 

M. 

Proposition 1. Suppose Assumption^holds. For any M > l,n > 1, the structured sparsity 
aggregate estimator satisfies: 

E\\X(3 - y\\l < min \ \\X(3- y\\l + min <^ , 9a^ log 1 + 



l3eR'^' { [ n ' n \ max(Mc(/3), 1) 

(25) 

Here, R = rank{X), and Mc{/3) = \\sparsity{l3)\\o c, where sparsity{j3) is the sparsity pattern 
of (3. 
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A key property of the next proposition is the existence of some 7 > 1 such that Vp G V : 

IIpIIo < IIpIIo.c < tIIpIIo- 

Proposition 2. Suppose Assumption^holds. Suppose the structural penalty in structured 
sparsity aggregate (SSA) estimator satisfies Vp G V : ||p||o < ||p||o,c < tIIpIIo for some 7 > 1. 
Then for any M > l,n > 1, the SSA estimator satisfies: 

^SSA cy^ 

E||X/3 -y\\< min{||X/3-i/||2 + 0„,j,,(/3)} + — (91og(l + eM) + 81og2) (26) 

n 



where 0n,Af (0) := and for f3 0: 



bnM = mm 



n' n \ max(Mc(/3), 1)/ ' ^/n y \ \\f3\\iy/yri J 

(27) 



4.2 Grouped £q Norm 

We first state an Assumption: 

Assumption 2. For all p eV where R > \\p\\o > 0; 



If «° < log (1 + ,f} „ ) , (28) 



\\p\\o,g V max(||p||o,g,lJ 

This assumption does not hold uniformly for all sparsity patterns and for all choices of 
Q. A sufficient condition for the assumption is: 

maxl^l < log(l + e|^|/i?). (29) 

gee 

In particular, for sparsity patterns with low Eq norm relative to M, the assumption is satisfied 
provided the cardinality of Q is large enough. 

Proposition 3. Suppose Assumption\^ holds. For any M > l,n > 1, the grouped sparsity 
aggregate estimator satisfies: 

E\\xf-y\\l< min { \\Xf3 - y\\l + min { 9a^^logfl+ ^"^ 



/seRAf [ \ n ' n \ max(Mg(/3), 1) 

(30) 

Here, R = rank{X), and Mg{f3) = \\sparsity{f3)\\Q g , where sparsity{f3) is the sparsity pattern 
of (3. 
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Proposition 4. Suppose Assumption^^ holds. Then for any M > l,n > 1, the grouped 
sparsity aggregate estimator satisfies: 

^GSA ci^ 

E||X/3 -y\l< min{||X/3-i/||2 + 0„,e(/3)} + — (91og(l + e|6;|) + 81og2) (31) 

/3eiR" n 

where 0n,g(O) := and for f3 0: 



= mm 



9a^Mg{(3) 



n 



n 



log 1 + 



e\G\ 



max(Mg(/3),l) 



n 



(32) 



4.3 Discussion of the Results 

For each class of prior, we give two main results. The first result shows that each procedure 
enjoys adaptation in terms of the appropriate structured sparsity measuring set function — 
II • ||o,c and II ■ ||o,g, respectively. The bound is thus best when the structured sparsity of 
the regression function is small, as measured by the appropriate set functions; the estimator 
adapts to the structured sparsity of the target. The second demonstrates that the estimators 
also adapts to structured sparsity measured in terms of a corresponding convex norm — || ■ ||i 
and II • lli^g. This is useful when some entries of f3 contribute little to the convex norm, but 
still incur a penalty in the corresponding set function. For example, a small isolated entry 
of /3 contributes little to the ii norm, but is heavily weighted in the structurally penalized 
£q norm. 



Comparing the results to the corresponding results in RigoUet and Tsybakov (2010 ), these 



results reveal some benefits and drawbacks to adding structure to the sparse aggregation 
procedure. In the penalized case, the results show that the structured estimator enjoys the 
same rates as the sparse estimator when the penalty is low. When structure is not present 
in the target, the sparse estimator is superior, as expected. Proposition [2] is still given in 
terms of the ii norm, which only measures sparsity. The price for adding structure to the 
procedure appears in the additional factor of y/^. While these results are not dramatic, the 
previous discussion (Section 3.1) and subsequent simulation study (Section 5.1) show that 
the penalized version is flexible and powerful in practice. 

In the grouped case, the results are more appealing. Since the grouped £o and ii norms 
are potentially much smaller than their ungrouped counterparts, the results here give better 
constants than their sparse versions. These improvements may be dramatic: previous work 
on the grouped lasso, cf Lounici et al. (2009), Huang and Zhang (2010), revealed great 
benefits to grouped structures. Following the settings of Lounici et al. (2009), consider a 



multi-task regression setting in which we desire the same sparsity pattern across tasks. Then, 
if the number of tasks is on the same order or of a higher greater than the number of samples 
per task (n), a grouped aggregation approach would reduce the order (in n) of the rates in 
the theoretical results. We can also expect such improvements for an overlapping set of 
groups that do not highly overlap. 
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The propositions given in the previous subsections are simphfied versions of those proved 
for the sparsity pattern aggregate estimator in RigoUet and Tsybakov (2010). We note 
that the full results can be extended to our estimators, we omit the derivation for brevity. 
In addition to these more complex statements, RigoUet and Tsybakov (2010) also gave a 
detailed theoretical discussion of these results in comparison to the lasso and BIC aggregation 
estimators Bunea et al. (2007), concluding that their estimator enjoyed superior and near 



optimal rates. Since our rates differ by no more than constants when the target is truly 
structured and sparse, we conclude that in such settings a structured approach can give 
great benefits. 



5 Applications 
5.1 Simulation Study 



We now turn to a simulation study. RigoUet and Tsybakov (2010) presented a detailed 



simulation study comparing their sparsity pattern aggregate estimator — see Section |2] — to 
numerous sparse regression methods. They demonstrated that the sparsity pattern aggregate 
was superior to the competitor methods. Therefore, we primarily compare our technique to 
the sparsity pattern aggregate estimator. We will show that the structured sparsity pattern 
aggregate estimator is superior under appropriate settings where the target is structured. 
For brevity, we consider only the structurally penalized io norm. In the following, we 



employ our cluster counting penalty, described in Section |3.1[ with h = 3. We consider two 
settings that offer natural geometries and notions of structure: connected components in a 
line structure (see, e.g. the top left display in Figure [T]), and blocks in a two-dimensional 
lattice (see, e.g. the top left display in Figure [2]). Using these natural geometries, we let 
d{-, ■) be eucUdean distance. We uniformly at random set the appropriate entries of a true 
coefficient vector f3 to be one of {+1,-1}. Each entry of the n x M design matrix X are 
independent standard random normal variables. We additionally generate a n x M matrix 
Xtest to measure prediction performance, see below. We consider different values of n — the 
number of data points, M — the number of candidate covariates; represented as columns in 
X, C — the number of clusters as measured by the cluster counting penalty applied to the 
true sparsity pattern, and Con — the number of nonzero entries per cluster in f3. We enforce 



non overlapping clus ters giving ||/3||o = C x Con- For direct comparison we follow RigoUet 



and Tsybakov (2010), and set the noise level a = ||/3||o/9, and run the MCMC algorithm for 
7000 iterations, discarding the first 3000. We repeat each simulation setting 250 times. 
We use two metrics to measure performance. First, prediction risk: 

Prediction(3) := ll^test(^ ' 3) Hi _ ^33^ 

n 

Our second metric measures the estimation of f3: 

Recovery (3) := M^M. (34) 

mi 
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In each of the above, (3 denotes some estimate of f3. We compare against our structured 
sparsity aggregate estimator (SSA) against the sparsity pattern aggregate estimator (SPA) 
and the lasso (lasso) — note that the true coefficients, while clustered, are not smooth, 
making these settings inappropriate applications for structured smooth estimators such as 



the Id or 2d fused lasso (Tibshirani, Saunders, Rosset, Zhu and Knight, 2005). For the lasso 



we we choose the tuning parameter A using 10-fold cross validation, and refit the model using 
ordinary least squares regression, both within and outside cross validation. This strategy 
effectively uses the lasso only for its variable selection properties and avoids shrinkage in (3. 



We employ the R package glmnet (Friedman et al. , 2008) to fit the lasso. 



Tables [T] and [2] display the results. In all cases, the structured sparse estimator is superior 
to the sparse estimator, and both methods are superior to the lasso. Although the mean 
prediction and recovery for the aggregation estimators are within two standard errors of 
each other, for paired runs on the same simulated data set, the structured sparse estimator 
is superior in both metrics at least 95% of the time, for all settings. Figures [T] and |2] display 
results for a sample sparsity pattern in both settings. We can clearly see the superiority 
of the aggregation methods over the lasso. In both figures, we see that both aggregation 
methods correctly estimated the true sparsity pattern. However, in the sparse estimator, the 
Markov chain spent many iterations adding and dropping covariates far away from the true 
clusters. This did not happen in the structured estimators, giving a much sharper picture 
of the sparsity pattern in both cases. Rejecting these wandering steps gave the structured 
estimator better numerical performance in both prediction and estimation. 

5.2 Application to HIV Drug Resistance 

We now explore a data application which calls for a structured sparse approach. Standard 
drug therapy for Human Immunodeficiency Virus (HIV) inhibits the activity of proteins 
produced by the virus. HIV is able to change its protein structure easily and become resistant 
to the drugs. The goal is then to determine which mutations drive this resistance. We use 
regression to determine the relationship between a particular strain of HIV's resistance to a 



drug and its protein sequence. Rhee et al. (2006) studied this problem using sparse regression 
techniques. 

Casting this problem as linear regression, the continuous response is drug resistance, 
measured by log dosage of the drug needed to effectively negate the virus' reproduction. 
The covariates derive from the protein sequences. Each sequence is 99 amino acids long, so 
we view each of these 99 positions as factors. Breaking each of these factors into levels, we 
obtain mutation covariates, which is our set of candidate predictors. If a location displays 
A different amino acids across the data, we obtain A — 1 mutation covariates. Thus, each 
covariate is an indicator variable for the occurrence of a particular amino acid at a particular 
location in the protein sequence. Note that many positions in the protein sequence display 
no variation throughout the data set — these positions always display the same amino acid 
— and are therefore dropped from the analysis. In summary, the predictors are mutations in 
the sequence, and the response is the log dosage. A sparse model would show exactly which 
mutations are most important in driving resistance. We are interested in which mutations 
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predict drug resistance, rather than only which locations predict dug resistance. Therefore, 
we do not select the mutation covariates from a location jointly. We instead treat each 
mutation separately. 

Additional biological information gives us reason to believe a structured, sparse model is 
more appropriate. Proteins typically function by active sites. That is, localized areas of the 
protein are more important to the protein function than others. Viewing the sequence as 
a simple linear structure, we expect that selected mutations should occur clustered in this 
structure. We can cluster the mutations by defining a distance in straightforward way: since 
each mutation covariate is also associated with a location, we can define d{-, •), the distance 
between a pair of mutation covariates, as the absolute difference in their locations. 

We apply our structured sparse aggregation (SSA) method along with sparse aggregation 
(SPA), forward stepwise regression, and the lasso to the data for drug Saquinavir (SQV) 
— see Rhee, Gonzales, Kantor, Betts, Ravela and Shafer (2003) for details on the data 



and Percival et al. (2011) for another structured sparse approach to the analysis; the data 
are available as a data set in the R package BLINDED Percival (2011). We set h = 3 in our 
cluster counting structural penalty for the structured aggregation method. 

We display a comparison of the sparsity patterns for the methods in Figure |3} We see 
that each method selects similar mutations. As expected, the structured sparse estimator 
encourages clustered selection of mutations, giving us two clear important regions. In con- 
trast, the sparse aggregation estimator, stepwise regression, and the lasso suggest mutations 
across the protein sequence. 

We finally evaluate the predictive performance of the four methods using data splitting. 
We split the data into three equal groups, and compare the mean test error from using each 
set of two groups as a training set, and the third as a test set. Table |3] shows that both 
aggregation estimators are superior to the lasso and stepwise regression. Although the mean 
test error is lower for the sparse aggregation estimator, it is within a single standard deviation 
of the structured estimator's mean test error. Therefore, the structured estimator gives 
comparable predictive power, with the extra benefit of superior biological interpretability. 



6 Conclusion 

In this paper, we proposed simple modifications of a powerful sparse aggregation technique, 
giving a framework for structured sparse aggregation. We presented methods for two main 
classes of structured sparsity: set function based structure and grouped based structure. 
These aggregation estimators place highest on weight models whose sparsity patterns are 
the most sparse and structured. We showed that these estimators enjoy appropriate oracle 
inequalities — they adapt to the structured sparsity of the targets. Further, we showed that 
in practice these methods are effective in the appropriate setting. 

In the theory throughout this paper, we considered a particular structure in the prior 
in order to easily compare theoretical properties with sparse estimators. In practice, the 
form of the prior may be modified further. For example, we need not restrict our structural 
penalty to be less than the number of predictors. In our current formulation, this restriction 
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forced us to consider sparsity and structure with equal weight. 

Ahhough both the sparsity pattern and structured sparsity pattern estimators display 
good promise theoretically and in practice, there are several practical challenges remaining. 
First, while RigoUet and Tsybakov (2010) suggested a strategy for dealing with the assump- 
tion that cT^ is known, it requires running another Markov chain to find a good estimate for 
cr^. This strategy is slow, and the stochastic greedy algorithm is much slower than compara- 
ble sparse techniques. While the algorithm is not prohibitively slow, speedups would greatly 
enhance its utility. Currently, the algorithm must be run for at least approximately 10 x M 
iterations so that it is time to search over all M covariates. Since each iteration requires an 
OLS regression fit, if M is of the same or greater order than n, this is a significant drawback. 
Thus, the estimator does not scale well to high dimensions. In future work, we can also 
consider a specialized version of the stochastic greedy algorithms adapted to our structured 
priors. 



A Implementation of Aggregation Estimators 
A.l Metropolis Algorithm 

Here, we give the implementation of the sparsity pattern aggregation estimator, proposed 
by RigoUet and Tsybakov (2010). This approach can be naturally adapted to the structured 
case. For numerical implementation, RigoUet and Tsybakov (2010) consider the following 
simplified prior: 

f 1 niPlloV'""" llnll < R 
[ else 

Initialize the algorithm by setting p(l) = E V. Repeat the following steps for t = 
1 T 

1. Generate a random integer i in the set {1, 2, ... , M} from a discrete uniform distribu- 
tion. Set the proposal sparsity pattern q(t) as p(t) with entries satisfying: 

^(t) ■■= I f^\,, (36) 
11- P{t)j i=J 

That is, entry i has been toggled from "on" to "off" , or visa versa. 

2. Compute /3p(t) and /3q(t), the least squares estimators under sparsity patterns p(t) and 
q(t), respectively. Let: 

r(t) =minf^,iy (37) 



p(t) 



^ - e.p (E„o. (3,,) - E™. (3,,,.)) . MM^) ||. ,3S) 
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Here, for the prior in Equation 



35 



71 



q(i) 



TT. 



llqWllo-| 


IpWIIo 




IpWIIo 



llqWIlo 



|p(t)||o 
2eM 



llqWIIo-llpWIlo 



(39) 



3. Update p{t) by generating the following random variable: 

+ 1) .= / ^(^) ^^^^ probability 

p(t) with probability 1 

4. If t < T, return to step 1 and increment t. Otherwise, stop. 



r{t) 
r{t) 



(40) 



After running the above algorithm, RigoUet and Tsybakov (2010) then approximate the 
sparsity pattern aggregate as: 



;5PA 



T-Tr 



T 
t=To 



(41) 



Here, Tq is an arbitrary integer, used to allow for convergence of the Markov chain. Note 
that the above algorithm can be applied to any prior for the class of aggregation estimators 



considered in this paper, we need only update Equation [39 

In the above algorithm, cj^ was assumed known. In general applications, is unknown. 



RigoUet and Tsybakov (2010) gave the following strategy for dealing with this case. Denote 
f3^ as the sparsity pattern estimator computed with cr^ = 6. Then, we estimate o"^ as: 



a 



iniU: 



^SPA, 

I 



n - Mn{f3, 



SPA., 



> a 



|/3,|>l/n- 



(42) 



Again, this strategy 



where a > is a tolerance parameter, and Mn{(3) 
needs no modification if the prior is changed. 

Note that while the sparse aggregation estimator in up-weights sparse models via the 
prior, it does not exclude any models. The exact estimator is therefore not sparse. However, 
this computational strategy nearly always results in a sparse estimate. This is because 
the Markov chain simply does not visit any models that are not sparse. Similarly, while 
the structured sparse priors we introduce do not eliminate structured sparse models from 
the exact aggregate estimators, the computed estimators almost always have this property. 
Alternately, we could run the Markov chain for a very long time, and obtain a model that 
includes all covariates. However, we would see that many covariates appear very seldom in 
the chain, and we could thus obtain a sparse or structured sparse solution with a simple 
thresholding strategy. 
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A. 2 Structural Modifications of the Algorithm 



In structured sparse aggregation, we can take advantage of the allowed sparsity patterns in 
the prior to streamline the metropolis algorithm. For grouped sparsity, we instead consider 
the hypercube of groups instead of the hypercube of all predictors. That is, given a set 
of groups Q, we instead consider patterns represented by {0, Ij'^L Effectively, we consider 
adding and removing groups as a whole, rather than individual coordinates. In the case 
of strong hierarchical sparsity, we can exclude any neighboring patterns that do not satisfy 
strong hierarchy. That is, given a DAG, we only consider adding direct descendants of the 
current sparsity pattern, or removing leaf nodes with respect to the current sparsity pattern. 



B Proof and Theoretical Framework 

In the following sections, we give a general theoretical recipe, leading to the results in Section 
4 in the main text. In Section B.3, we give two Lemmas for our specific applications. 



B.l Priors and Set Function Bounds 



Lemma 1. (From Rigollet and Tsybakov (2010)) Fix p G {0,1}^"'^, assume that are iid 
random variables such that K^i = 0, and K^f = a"^ , for i = 1, . . . ,n. Then for least squares 
estimator: 



argmin X/3||2, 

supp{f3)csupp(p) 



(43) 



we have: 



E\\Xf3p-y\\l<mm\\Xf3-y\\l + a 



,min(||p||o,i?) 



n 



(44) 



Where R = rank{X). 

Now, suppose that we have a set function ^A : 2^ ^ M+. We then define Mm{x) : 
]R+ as A^(supp(a:)). We then use a prior of the form: 



3 A/ 




l|p||o<i? 
|p||o = M 
else 



(45) 



Here R is the rank of X and C > 1 is such that the normalizing constant Hm < 4. Note that 
4 is an arbitrary constant used for the sake of consistency throughout the theory presented 



here and in Rigollet and Tsybakov (2010) 
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Lemma 2. (From Leung and Barron (2006), RigoUet and Tsybakov {201C^) Consider the 


sparsity pattern estimator with prior it m: X(3 , then: 



wWYli^ i|2^ • Jwiiir;5 112 , 4a'log(v^) ^ 
E Xp — y\\y < mm < E Xp„ — w p H — > (46) 

pe{o,i}M;,rp.^7^o ^ n ' 



We now make the following assumption: 
Assumption 3. For all p E V where i? > ||p||o > 0; 



"^"^ <logfl+ ..f. ..J . (47) 



Mm{p) ~ \ max{MM{p),l 
Now, for p such that ||p||o < R, the following holds: 
Lemma 3. 

4a^log(7r^,^) ^ 8aH4Mip) A ^ eM \ ^ i^g2 (48) 

We now present the main general result: 

Proposition 5. For any M > 1, n > 1, the sparsity pattern estimator with prior ttj^: X(3 
satisfies: 

E||X/3 - t/lla < mm 



liiiii s 1 1 Xp — vWo + mm < , 9(J log 1 t —— — —-. — r 

^eMM\ii ^ ^112 I ^ , ^ max(M;K(/3),l) 

(49) 

Proof. For ||/3||o < -R, we know from combining Lemma |2| Lemma [sj and Assumption |3] that 
X/3 satisfies: 

EX/3 -y2< mm <^ X/3 - y ^ + ^^1^ log M + k + _ 

" ^ -^"^ - ^eRM||/3||o<ij\" ^ -^ll^ ^ max(M^(/3),l)y J n 

(50) 

For ||/3||o = M, we have: 

E||x3"^-y||2< min |||X/3 - y||^ + a^- j + — log2. (51) 



And so the proposition follows directly. □ 
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B.2 Convex Norm Bounds 

Lemma 4. For integer M > 0, define I — {1, ... , M}. Suppose that we have a set function 
M : 2^ ^ R+ and norm || • ||^ : ^ R. We then define Mm{x) : ^ R+ as 
A4{supp{x)). Then, if for any /3* e R'^yiO}, any integer k > 1, and any function f we 
have: 

||/^*||2 

min ||/-X/3||^ < . (52) 

^:||/3||Al = ||/9*||A<;M^(/3)<fc"-' mm{k, Mm{I3)) 

Then, for any C > 1, integer n > 0, constant u > 0, a given real number k* > 1, 
13* G ]R*^\{0}, and any function Mj^{x) that satisfies, for some 7 > 0; Mj^{x) < Mj^{x) < 
-fMM{x) Vx e R^; 

,2 , ,Mm{/3), a , eC 



<iw-.iii+.^5:i„gri+£gV«^. (53) 



n \ k* J k 
Proof We consider two cases, k* < Mm{(^*), and k* > Mm{^*) 
• Let A;* < M^(/3*). 

,2 , 2^A^(/3), A , eC 



min <^ ||X/3-y||^ + t/^ ^ '^'^M og 1 + = } < 

< min mm < ||Xp — y||2 + ^ ^og I 1 + 



l<fe<M^(/3*)/3eK*f:M;vi(/3)<fc t TT- V max(M_A4 (/3) , 1) 



(54) 



< min min < I |X/3 — y| U + log 1 + , , . 

i<fe<M^(^*)/3eM^:MAi(/3)<fe/7 t V niax(7M^ (/3) , 1 j 

(55) 



< min < min 

l<k<MM{l3*) l/3eK^:||/3||A4=||/3*||M;MM(/3)<fe/7 



(||X/3-y||^}+.^^log(l + f)} 



(56) 

<||Xr-y||^+ min ( + log f 1 + ^"l | (57) 

i<k<MM{f3*) I k n \ k J } 

<||X/3--y||l + .^|log(l + ^)+:^ (58) 

In the above, we use the monotonicity of the mapping g{t) — ^ log (l + ^) for t > 1. 
We apply the assumptions on Mm in the next steps. We finally use the fact that 
k* < Mm{/3*) in the fourth step. This completes this case. 
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For k* > M_m{i3*), we can use a simple argument: 



,2 , 2Mm{P) , a , eC 



min <^ \\Xf3 - y\\i + ' ""^M og 1 + = )■ < 



< ||X/3*-y||^ + z/^ ^"^^ M og 1 + =^ (59) 

- " n H max(M^(/3*),l)y ^ ^ 

<||X/3*-y||^ + .^^log(l + ^)+«l (60) 

These two cases complete the proof. □ 

Note that Lemmas [5] and [6] give results that guarantee that the conditions of the above 
lemma are satisfied in the two important cases considered in this paper. We may also use 
Mm{-) with 7 = 1 in place of Mm at all points in this lemma and obtain the same result 
in terms of Mv((-)- light of this result, we now give a generalized version of Lemma 8.2 
from RigoUet and Tsybakov (2010). The result of the lemma has been simplified from the 



version m 



RigoUet and Tsybakov (2010), but the full result still holds. 



Proposition 6. Assume all of the conditions of Lemma^ Then, 

2 

E||X3"^ -y\\l< min {||X/3 -y\\l + 0„,a/(/3)} + — (91og(l + eM) + 8 log 2) (61) 



n 



where 0n,Af,c,x (0) := and for f3 0: 
4>n,M,c,M = min 



cr2 9a^MMif3)^^^r ^ _eC \ llaV7ll/3|U A A , ^eCa 



n n 



max(M^(/3),l); ' V V WPWmV^ 

(62) 



Proof. We first show the following: 



,2 , 2MMiP),_f-, , eC 



min <^ \\X(3 - y\\i + ' ' log 1 + = ) < 

n ^\ max(M^(/3),l);j - 

< mm {llX/3 - y||2 + (3 + l/e)0„,^,,^,^(/3)} . (63) 
Where 4>n,M,c,M^f^') = for /3 = and otherwise: 



V«,c,«(/3) = ^^V'"' V ^ mW^n) + (3+l/e)n ' '^^^ 



It is clear that Equation |63] holds for /3 = 0. For /3 7^ 0, we begin with the statement of 
Lemma HI 



,2 , 2^A^(/3), A , eC 



min <^ ||X/3-y||^ + t/' '"'^^M og 1 + — — H < (65) 

/3eKM\" ^ -^"^ n max(M;v((/3),l) ' ' - ^ ^ 



,2 , „2^%.„A , eC^^ , 711/3*11^ 



< I |X/3* - y 1 1^ + log 1 + - + (66) 
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Then, using the proof of Lemma 8.2 in RigoUet and Tsybakov (2010), we can show: 



+ - + l/e)0.,M,c,A.(/3*)- (67) 



We next lei a = v and combine Equation 63 with Lemma M to complete the proof. The 



constants are finally rounded up to the nearest integer for clarity as in RigoUet and Tsybakov 



(2010). □ 



B.3 Lemmas for Specific Norms and Set Functions 

We now give a two lemmas guaranteeing that the conditions in Lemma |4] are satisfied in 
two particular settings. The following lemma is a notationally adapted version of lemma 8.1 
in 



RigoUet and Tsybakov (2010) and is given without proof: 



11/3*1 


|2 


min(fc, 11/3*1 


o,g) 



Lemma 5. For any f3* G ]R*''^\{0}, any integer k > 1, X such that maxi<j<A,/ ||a;j||2 < 1? 
and any vector y: 

min ||y-X/3||^< ||y-X/3*||^+ . ^, , , (68) 

/3:|/3|i=|/3*|i;||^||o<fc"'' " " " mm(A;, 1 1/3* | |o) ^ ' 

We now give a version of this lemma for grouped ^o-like norms: 

Lemma 6. For any (3* G ]R*^\{0}, any integer k > 1, X such that maxi<j<A,/ ||a^j||2 ^ 1; 
and any vector y: 

min \\y- Xf3\\^ <\\y- Xf3*\\^ + . '"' ''"Z, , (69) 

Proof. Fix (3* G M*^\{0}, and integer k > 1. Set K = min(A;, 1 1/3*| |o,g). Let Vg(/3*) = {v*g} 
be a ^-decomposition of f3* minimizing the norm ||/3*||i,g. Then, define the multinomial 
parameter, a \Q\ vector, q = {qi, ■ ■ ■ ,q\g\}, with Qg = jp^^j—- Let the |^|-vector k have 
multinomial distribution M(i^", q). We then define the random vector f3 G M^sl^l, a con- 
catenation of \Q\ vectors: [/3g]geg, with components f3g = ^^^k\\v*\\2 '^ ' Here, we adopt the 
convention that if f^/U'w^lb = 0/0, then fg/||f*||2 = (note that = in this case). Thus, 
we have that Kf3g = v*, and Y{ng) = Kqg{l — qg). Now, we have that the entries on the 
diagonal of the covariance matrix of /3 is bounded as follows: 

T^- \ ^ Il/3*lli,g||^^3ll2 

Diag(Ej < ^ (70) 

Here l\g\ is a \g\ length vector with entries all equal to 1. Now, let (3 G MJ^ be such that 
Vg{(3) = {vg{f3)}, where Vg{f3) G M^^ is equal to f3g for the indices in g, and equal to zero 
otherwise. 
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Then, we define the following n x 'Ylig^g \9\ matrix: X = [xj : j G fi'jgeg, where Xj is the 
ith. column of X. Let f3* = [f3*g]g^g. Now, if maxjgi ||xj||2 < 1, then for any vector y we 



have: 



E||y-X/3||2 = E||y-X/3||; 



|y_X/3*||2 + -^XfSx 



i=l 
*||2 



< ||y-X/3*||2 + 



W\\\ 
K 



(71) 
(72) 

(73) 



In th e above x^ is the i th row of X. It is clear that ||/3||o,g < Further, by Corollary 1 
from I Jacob et al.l (|2009|), p||i,g = ||/3*||i,g. The result then follows. □ 



B.4 Discussion 

The theoretical framework presented here leads us to postulate that there are many potential 
aggregate estimators that give similar theoretical guarantees. However, the assumptions in 
Lemma |4] play a key role. Beginning with a set function and its corresponding convex 
extension, we could propose a prior that would give us an aggregate estimator that would 
enjoy adaptation to patterns in terms of both the set function and the convex norm. In 
light of the assumptions in Lemma |4| it is necessary to produce a general form of Lemmas |5] 
and|6] — which give a bound on the approximation of y when we restrict the approximating 
functions to a class that depends on our set function. Such a result remains an open question, 
but we suspect it is not attainable for all set functions. However, since such a result needs to 
hold only for a set function (and its corresponding convex norm) that is a bounded between 
our target function, there may exists several interesting extensions that we have not proposed 
in this paper. 

[Table 1 about here.] 

[Table 2 about here.] 

[Table 3 about here.] 
[Figure 1 about here.] 
[Figure 2 about here.] 
[Figure 3 about here.] 
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List of Figures 



A structured sparsity setting for a linear structure in the coefficient vector 
— adjacent entries of (3 are considered close. We display the true linear 
sparsity pattern (top) and recovered sparsity patterns by structured, sparse 
aggregation (second from top), sparse aggregation (third from top), and cross- 
validated lasso (bottom). Black (0%) to white (100%) indicates percentage of 
selection in the Markov Chain algorithm for the aggregation estimators. For 
the lasso, black (0%) to white (100%) indicates the percentage of selection out 
of 100 replications of cross validation. Structured sparse aggregation is able 
to best recover the true sparsity pattern. Both sparse aggregation and the 
lasso suffer from false positives scattered throughout the space of candidate 

covariates EH] 

An example of structured sparsity in a two dimensional lattice — the coeffi- 
cient vector (3 is an unraveled matrix. We display the true 2-D lattice sparsity 
pattern (top left) and the recovered sparsity patterns by structured, sparse 
aggregation (top right), sparse aggregation (bottom left), and cross- validated 
lasso (bottom right). Black (0%) to white (100%) indicates percentage of se- 
lection in the Markov Chain algorithm for the aggregation estimators. For the 
lasso, black (0%) to white (100%) indicates the the percentage of selection out 
of 100 replications of cross validation. Structured sparse aggregation is able 
to best recover the true sparsity pattern. Both sparse aggregation and the 
lasso suffer from false positives scattered throughout the lattice of candidate 

predictors [29] 

Structured sparsity in an application: HIV drug resistance. The panels give 
the selected sparsity patterns across HIV protein mutations for structured, 
sparse aggregation (top), sparse aggregation (second from top), stepwise re- 
gression (third from top), and the lasso (bottom). Each box represents a 
mutation covariate. The horizontal axis represents location in the protein se- 
quence. The locations (1 to 99) are arranged left to right as in the protein 
sequence. The vertical axis has no spatial meaning, each stack represents the 
number of mutations observed at that location in the protein sequence. Mu- 
tation predictors in adjacent bands are from adjacent locations in the protein 
sequence. Since proteins typically function via active sites, our structured 
model encourages clustered selection in the sequence. For the aggregation 
methods, the color of the boxes indicates the percentage of selection in the 
Markov Chain algorithm: Black (0%) to white (100%). If a mutation is never 
selected, it is gray and diagonally shaded. For the lasso, black (0%) to white 
(100%) indicates the the percentage of selection out of 100 rephcations of 
cross validation. For stepwise regression, we only report the selection a single 
instance of the algorithm [30] 
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Figure 1: A structured sparsity setting for a linear structure in the coefficient vector — adja- 
cent entries of /3 are considered close. We display the true linear sparsity pattern (top) and 
recovered sparsity patterns by structured, sparse aggregation (second from top), sparse ag- 
gregation (third from top), and cross- validated lasso (bottom). Black (0%) to white (100%) 
indicates percentage of selection in the Markov Chain algorithm for the aggregation estima- 
tors. For the lasso, black (0%) to white (100%) indicates the percentage of selection out 
of 100 rephcations of cross vahdation. Structured sparse aggregation is able to best recover 
the true sparsity pattern. Both sparse aggregation and the lasso suffer from false positives 
scattered throughout the space of candidate covariates. 
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Figure 2: An example of structured sparsity in a two dimensional lattice — the coefficient 
vector 13 is an unraveled matrix. We display the true 2-D lattice sparsity pattern (top left) 

and the recovered sparsity patterns by structured, sparse aggregation (top right), sparse 
aggregation (bottom left), and cross- validated lasso (bottom right). Black (0%) to white 
(100%) indicates percentage of selection in the Markov Chain algorithm for the aggregation 
estimators. For the lasso, black (0%) to white (100%) indicates the the percentage of selection 
out of 100 rephcations of cross vahdation. Structured sparse aggregation is able to best 
recover the true sparsity pattern. Both sparse aggregation and the lasso suffer from false 
positives scattered throughout the lattice of candidate predictors. 
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Figure 3: Structured sparsity in an application: HIV drug resistance. The panels give the 
selected sparsity patterns across HIV protein mutations for structured, sparse aggregation 
(top), sparse aggregation (second from top), stepwise regression (third from top), and the 
lasso (bottom). Each box represents a mutation covariate. The horizontal axis represents 
location in the protein sequence. The locations (1 to 99) are arranged left to right as in the 
protein sequence. The vertical axis has no spatial meaning, each stack represents the number 
of mutations observed at that location in the protein sequence. Mutation predictors in 
adjacent bands are from adjacent locations in the protein sequence. Since proteins typically 
function via active sites, our structured model encourages clustered selection in the sequence. 
For the aggregation methods, the color of the boxes indicates the percentage of selection in 
the Markov Chain algorithm: Black (0%) to white (100%). If a mutation is never selected, 
it is gray and diagonally shaded. For the lasso, black (0%) to white (100%) indicates the the 
percentage of selection out of 100 replications of cross validation. For stepwise regression, 
we only report the selection a single instance of the algorithm. 
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List of Tables 



Simulation results for 1-dimensional linear sparsity patterns, see e.g. Figure [T} 
Both sparse (SPA) and structured sparse (SSA) aggregation methods outper- 
form the lasso in terms of prediction and recovery of the true sparsity pattern. 
Note that for paired runs of both aggregation methods on a single simulated 
data set, the structured estimator is superior in both measures at least 95% 
of the time. For each measure, the mean over 250 trials is reported with the 

standard error in parentheses 

Simulation results for 2-dimensional lattice sparsity patterns, see e.g. Fig- 
ure g Both sparse (SPA) and structured sparse (SSA) aggregation methods 
outperform the lasso in terms of prediction and recovery of the true sparsity 
pattern. Note that for paired runs of both aggregation methods on a single 
simulated data set, the structured estimator is superior in both measures at 
least 95% of the time. For each measure, the mean over 250 trials is reported 

with the standard error in parentheses 

Comparison of predictive power for the HIV data. We estimated the testing 
errors using three fold data splitting. We see that the structured aggregation 
estimator gives comparable predictive performance to the sparse aggregation 
estimator. The structured estimator also carries the benefit of better biological 
interpretability. The mean test error is given with the standard errors in 
parentheses 
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Table 1: Simulation results for 1-dimensional linear sparsity patterns, see e.g. Figure [Tj 
Both sparse (SPA) and structured sparse (SSA) aggregation methods outperform the lasso 
in terms of prediction and recovery of the true sparsity pattern. Note that for paired runs of 
both aggregation methods on a single simulated data set, the structured estimator is superior 
in both measures at least 95% of the time. For each measure, the mean over 250 trials is 
reported with the standard error in parentheses. 



(n, M, C, Con. ct) 


Prediction (SPA) Prediction (SSA) Prediction (lasso) 


(100, 100, 1, 9, 1) 
(200, 500, 1, 20, 1.5) 
(100, 100, 2, 5, 1.1) 


0.168 (0.104) 0.123 (0.065) 0.813 (0.347) 
0.371 (0.132) 0.298 (0.156) 2.765 (0.723) 
0.156 (0.078) 0.138 (0.079) 1.012 (0.46) 


(n, M, C, Con. a) 
(100, 100, 1, 9, 1) 
(200, 500, 1, 20, 1.5) 
(100, 100, 2, 5, 1.1) 


Recovery (SPA) Recovery (SSA) Recovery (lasso) 
0.018 (0.01) 0.014 (0.006) 0.089 (0.035) 
0.019 (0.007) 0.015 (0.007) 0.14 (0.034) 
0.016 (0.007) 0.014 (0.007) 0.104 (0.044) 
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Table 2: Simulation results for 2-dimensional lattice sparsity patterns, see e.g. Figure [2] 
Both sparse (SPA) and structured sparse (SSA) aggregation methods outperform the lasso 
in terms of prediction and recovery of the true sparsity pattern. Note that for paired runs of 
both aggregation methods on a single simulated data set, the structured estimator is superior 
in both measures at least 95% of the time. For each measure, the mean over 250 trials is 
reported with the standard error in parentheses. 



{n, M, C, Con, a) 


Prediction (SPA) Prediction (SSA) Prediction (lasso) 


(100, 100, 1, 9, 1) 
(200, 400, 2, 9, 1.4) 


0.131 (0.063) 0.113 (0.059) 0.844 (0.398) 
0.265 (0.088) 0.214 (0.085) 2.061 (0.616) 


(n, M, C, Con, (t) 


Recovery (SPA) Recovery (SSA) Recovery (lasso) 


(100, 100, 1, 9, 1) 
(200, 400, 2, 9, 1.4) 


0.015 (0.006) 0.013 (0.006) 0.094 (0.043) 
0.015 (0.004) 0.012 (0.004) 0.115 (0.032) 
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Table 3: Comparison of predictive power for the HIV data. We estimated the testing errors 
using three fold data splitting. We see that the structured aggregation estimator gives 
comparable predictive performance to the sparse aggregation estimator. The structured 
estimator also carries the benefit of better biological interpretability. The mean test error is 
given with the standard errors in parentheses. 



Data Splitting 
Test Error 


Sparse Structured Stepwise 
Aggregation Aggregation Regression lasso 




ytest - Xtest3 


2 
2 


0.65 (0.04) 0.69 (0.07) 3.03 (0.18) 1.45 (0.2) 
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